Prosper Loan Data Exploration by Mark Mavromatis

Clean up column data types

Some data has been parsed incorrectly: i.e. dates as factors, prosper scores as integers (instead of factors), etc. We should clean this up so that the data is easier to work with.

Basic Statistics

## Total # Rows:  113937
##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "2005-11-15" "2008-10-02" "2012-06-26" "2011-07-21" "2013-09-18" 
##         Max. 
## "2014-03-12"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

How is Prosper data changing over time? Is the peer-to-peer lending business growing, shrinking, or staying the same? First let’s look at # of new loans issued daily:

Some observations about this data:

  • Number of loans is definitely increasing.
  • There is a large gap where no loans were issued between late 2008 and mid 2009.

What could be the reason for this gap?

https://en.wikipedia.org/wiki/Prosper_Marketplace#Cease_and_desist_order

On November 24, 2008, the SEC found Prosper to be in violation of the Securities Act of 1933. As a result of these findings, the SEC imposed a cease and desist order on Prosper. Due primarily to the novel nature of the peer-to-peer lending models, the SEC, after review, now treats all peer-to-peer lending transactions as sales of securities and requires that all platforms register with the SEC.

Ok this explains the gap.

How are loan amounts changing over time?

Daily loan amounts

Both the number of loans and loan amounts (aside from the interruption in 2008) have increased How about total loaned amounts?

Plot total loan amounts over time

The daily amounts have peaks and valleys that make the plots tough to look at. Using the quarterly date column, let’s see how data changes across quarters instead of each day:

Total amount loaned each day has increased dramatically since day 1 of the Prosper service. The last column shows a slight decrease but this is likely due to an incomplete quarter. The last day for loan data was March 12, 2014 with 19 days left in the quarter.

What is the status of Prosper Loans if we check across origination dates?

So many (12!) statuses! It is difficult to read so many different statuses on a stacked bar chart. Let’s redraw this using a new “Visual Status” field that summarizes status into positive (current, paid off, prepaid) and negative (late payments, loan default) buckets.

Defaults as a percentage of all loans seem to be decreasing. This is a good sign.

One point of interest is that there are many prepaid (Status is complete, Close Date < Loan Maturity Date) loans in Prosper where borrowers paid off the loan ahead of schedule. How many of the completed loans are prepaid versus paid on schedule?

Most of the completed loans, even dating back to 2008, are prepaid. This is troubling for Prosper investors (loan buyers) who may expect a steady stream of payments up until the maturity of the loan.

Loans may be prepaid for a variety of reasons including reduction in interest rates, better loan terms offered elsewhere, selling personal assets to pay debt, or general improvement in personal wealth. Without more information, it is hard to identify why Prosper borrowers are prepaying their loans in such high numbers.

Univariate Plots Section

What are the common loan sizes for Prosper Loans?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

Many loans are quite small (less than US$10,000). The largest loan offered by Prosper is $35,000 (confirmed on their website).

Many of the loans are in round units of $5,000. The peaks in this histogram are at $5k, $10k, $15k, $20k, and $25k. $35k loans are new (see chart in the final Plots section) so the number of these loans is still quite small.

What kinds of loans are underwritten by Prosper? Let’s check ListingCategory:

##     0     1     2     3     4     5     6     7     8     9    10    11 
## 16965 58308  7433  7189  2395   756  2572 10494   199    85    91   217 
##    12    13    14    15    16    17    18    19    20 
##    59  1996   876  1522   304    52   885   768   771

Debt consolidation (1) is by far the most popular type of loan. The 2nd and 3rd most popular listing categories are “Not Available” (0) and “Other” (7) which is not very descriptive. Home Improvement (2) and Business (3) loans round out the top 5 categories.

How about borrowers and their credit? What kinds of borrowers are more common in the Prosper Loans marketplace?

Prosper changed their credit rating field in 2009 from “CreditGrade” to “ProsperRating”. In order to effectively plot data across all dates, I have merged the two credit rating fields into a single “UdacityRating” field.

There is a normal distribution of credit ratings with most loans occurring at medium credit quality.

What is the distribution of interest rates on these loans?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.15630 0.20980 0.21880 0.28380 0.51230      25

Interest rates range from 0.6% to 51.2%! Mean interest rate is 21.9%.

51.2%? This seems exceptionally high. What kinds of borrowers have to pay interest of more than 50%?

##       LoanOriginationDate ClosedDate CreditGrade IsBorrowerHomeowner
## 36019          2006-02-27 2006-03-14          HR               False
## 56762          2006-03-21 2006-09-19          HR               False
##       BorrowerAPR LoanOriginalAmount LoanStatus StatedMonthlyIncome
## 36019     0.51229               1500  Completed                2838
## 56762     0.50633               3000  Defaulted                   0

The borrowers of these 2 loans had exceptionally bad (or no) credit. The credit grade of “HR” refers to “Highest Risk” which is the lowest credit grade. Either there was no credit history or the borrower has a history of defaults.

(Sourced from: http://www.liquisearch.com/prosper_marketplace/evaluation_of_credit_risk/credit_grades )

Additionally, the borrowers were not homeowners and had low stated monthly incomes ($2,838 and $0 respectively). The high risk of lending to these individuals likely led to the high interest rate. The first borrower paid back the loan within a month. The second borrower defaulted within 6 months.

With a large binwidth (5%), the histogram looks similar to a normal distribution. However, if we reduce the BorrowerRate APR binwidth to 1%, the graph shows more widely distributed interest rates.

## [1] "# of Loans with Borrower APR between 35% and 36%:  7693"

Here, the distribution appears wider with multiple peaks including more than 7000 loans between 35 and 36%.

Of loans that are in default, how much of the principal has been paid off?

Most defaults happen with >75% of the principal still left unpaid.

What is the term breakdown for Prosper loans?

36-months is the most popular term length for Prosper Laons. 60-months is the next most popular term.

12-month loans do not appear to be popular. Is this term still offered?

## [1] "Last 12-month loan was offered in:  2013-04-15"
## [1] "Last loan in the system was offered:  2014-03-12"

The last 12-month loan was 11-months before the last loan in this dataset so this term may no longer be offered.

What is the loan status breakdown for all closed loans?

“Charged-off” and “Defaulted” loans technically mean the same thing: The borrower was unable to pay back the loan. Let’s compare the total number of defaulted loans to the number of ‘completed’ loans using nrow().

## [1] "# of Closed loans:  55089"
## [1] "# of Completed (paid-off) loans:  38074"
## [1] "# of Unpaid / Defaulted loans:  17010"

Almost 1/3 of the closed loans are in default! This does not sound good. However, there are also a large number of open loans in the system which are not counted here.

What are the prosper scores for these defaulted loans?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   4.000   5.000   5.427   7.000  11.000   10669

Prosper scores for these defaults follow a normal distribution where the mean score is 5.4 right down the middle. (Highest prosper score possible is 11)

For defaults, what % of the loan principal remains unpaid?

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.01884  0.61330  0.81160  0.73710  0.92300  1.00000

The histogram shows that most loans default with most of the principal still unpaid. The average default loan has 75% of the principle outstanding. This is not a good sign. What is the distribution of prosper scores where over 75% of the principal remains outstanding? If Prosper’s scoring algorithms are predicting losses correctly, we should see lower scores in this distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   5.000   5.212   7.000  11.000

The scores here follow a similar distribution with the normal distribution centered at 5.2 instead of 5.4. Since Prosper Score combines borrower credit rating with historical Prosper data to create an improved ‘risk score’, one would think this field would better predict large defaults. Has this scoring system become any better at predicting large losses since it was introduced in 2009? Let’s compare pre- and post-2011 loan scores and see.

## [1] "Average time (# days) to default (for defaulted loans)"
## [1] 510.8509

The distributions are very different. Instead of a normal distribution of prosper scores, the post-2013 scores shows a downward trending pattern where higher scores (better loans) experience fewer defaults. This is ideal behavior and indicates that Prosper Score may be improving at predicting default risk.

However, since post-2013 loans are relatively new and the average loan default occurs after 510 days, perhaps this distribution reflects poor quality borrowers who default early in the loan term. As loans mature, many more could default which would affect the above histogram. My personal belief is that when analyzing defaults and probability of defaults, examining defaults on loans issued after October 2012 (511 days before this file was created) are not worth investigating in this particular dataset.

Univariate Analysis

What is the structure of your dataset?

This dataset includes 113,937 loans issued between November 15, 2005 and March 12, 2014. These loans were for amounts of between $1,000 and $35,000 with a mean (average) loan amount of $8,337. There is a gap in dates between October 18, 2008 and May 5, 2009. This gap represents period when the Securities Exchange Commission (SEC) ordered Prosper Marketplace (and other peer-to-peer lenders) to temporarily cease business activities pending regulatory review. This event is described in the Wikipedia entry for Prosper Marketplace:

https://en.wikipedia.org/wiki/Prosper_Marketplace

Each record (row) in the dataset includes different kinds of information.

Borrower Details:

  • Borrower State
  • Occupation
  • Employment Status
  • Employment Status Duration
  • Income
  • Homeowner Status
  • Credit Score Ranges
  • Open / Revolving Credit Lines
  • Credit Inquiries
  • Past Delinquencies

Loan Details

  • Origination Date
  • Term
  • Amount
  • Prosper “Score”
  • Estimated Loss
  • Borrower APR
  • Monthly Payment

Loan Status

  • Status - Current, Completed, Defaulted, etc.
  • Closed Date (if closed)
  • Days Delinquent
  • Customer Payments
  • Interest and Fees (out of total payments)
  • Principal Payments (out of total payments)
  • Net Principal Loss after recoveries (for defaulted loans)

Borrower credit quality is measured in different ways depending on the date of the loan. Key components are borrower credit scores (available for all loans except a few of the oldest loans), existing/open lines of credit, income, and home ownership. These attributes are combined into a single credit quality field which changed midway through the time this data was collected.

Pre-2009:

CreditGrade - Experian ScoreEx rating.From lowest to highest: NC (No Credit), HR (High Risk), E, D, C, B, A, AA

Post-2009:

ProsperRating (Numeric) - From lowest to highest: 1, 2, 3, 4, 5, 6, 7

The old scoring system has 8 levels where the new system has 7. From reviewing web sites related to Prosper Rating, I believe that “NC” (No Credit) and “HR” (High Risk) from the old-style “CreditGrade” field have been combined into a single rating (“HR”) in the new ProsperRating.

References:

http://www.orchardplatform.com/blog/alphabetic-ratings-and-numeric-scores-on-prosper/ http://www.lendacademy.com/prosper-review/

Borrower credit quality is an input to the Prosper Score which appears to be the key variable determining interest rate in today’s Prosper Marketplace loans.

https://www.prosper.com/plp/general-prosper_score/

What is/are the main feature(s) of interest in your dataset?

The main features of interest to me are:

  • BorrowerAPR - Price of the loan for the borrower
  • UdacityRating - Prosper tracks borrower credit quality using two fields: (pre-2009) CreditGrade and (post-2009) ProsperRating. I have combined these two into a single field “UdacityRating” so that we can compare scores across all data.
  • EstimatedLoss - How well does Prosper predict defaults and lost principal with this field?

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

  • LoanStatus - Shows us the loan (late payments, default, completed) status which we can compare to BorrowerAPR and Prosper Rating to examine how well Prosper predicts losses.
  • ProsperScore - Quality of the loan taking into account credit quality and prior Prosper loan trends
  • CreditScoreRangeUpper - High end of credit scores for a particular borrower
  • EstimatedLoss - How well does Prosper estimate default risk / size?
  • NP_NetPrincipalLoss - How much principal remained unpaid after a loan defaulted (and after recoveries)?
  • UdacityInDefault - Added by me to flag defaulted loans in our data set.

Did you create any new variables from existing variables in the dataset?

Yes, I created the following variables:

  • UdacityInDefault - Has the borrower defualted on the loan (1 = yes, 0 = no)
  • UdacityDefaultPrincipalUnpaidPercent - For defaulted loans, what percentage of the loan remains unpaid?
  • UdacityRating - Combination of pre- and post-2009 credit scoring fields (CreditGrade and ProsperRating)
  • UdacityEstimatedLossAmount - Estimated loss amount based on LoanOriginalAmount and EstimatedLoss (%)
  • UdacityMaturityDate - Expected close date of loan based on paid on schedule loan.
  • UdacityIsPrepaid - Was this loan paid fully before the expected maturity date?
  • UdacityLoanStatusVisual - Consolidated loan statuses into 4: OpenPositive (Current), OpenNegative (late), ClosedPositive (Paid on Schedule), ClosedNegative (Defaulted), ClosedPrepaid (Paid early)

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I combined two fields (CreditGrade and ProsperRating..numeric.) into a single field: UdacityRating. This will make plots possible across all the data instead of breaking things up into pre-2009 (CreditGrade) and post-2009 (ProsperRating) charts.

I also created a parallel “Visual Status” field (see above) so that I would not have to work with the granular statuses used in this dataset.

I also changed the data types of a few fields to fix parsing issues after loading the CSV file.

Date fields originally got parsed from the CSV file as factors so I had to change them to date fields. This is useful for running calculations against dates (date differences) and ordering dates properly.

Also, listing categories, terms, and rating scores were originally parsed as character data or numeric data. I changed these to factor fields.

I did not notice highly unusual distributions although there was a wide range of Borrower APR interest rates across the data (0.6% - 50%).

Bivariate Plots Section

The key driver for BorrowerAPR appears to be borrower credit rating (UdacityRating). Which columns correlate strongest to this credit rating field?

##     1     2     3     4     5     6     7     8     9    10    11  NA's 
##   992  5766  7642 12595  9813 12278 10597 12053  6911  4750  1456   144

Credit Score

Credit score appears to have the highest correlation to borrower credit rating. How does this look on a chart?

Aside from a relatively flat line at lower credit score levels, the correlation beteen credit score and credit rating appears quite high. How many Prosper borrowers have a credit score below 500?

## # of Loans (rows): 113937
## Minimum credit score: 19
## # of Loans with <500 credit score borrowers: 1253
## # of loans with 19 credit score borrowers: 724

Many of these loans are associated with borrowers with a credit score of 19! I did not think this was possible!

Around 1% of the borrowers have credit scores less than 500. These loans appear quite rare which may explain why these loans are grouped in the same prosper rating category as typical lower credit scores (500 - 600). If we remove these outliers, does the correlation get stronger?

Yes! When low credit scores (possibly outliers) are removed, credit score becomes a key driver in Udacity’s credit rating system!

Current Delinquencies

Are the number of borrower delinquencies correlated with the borrower’s credit rating?

Delinquencies and credit rating are negatively correlated which is expected. However at 50 delinquencies, there is a huge uptick in Prosper Rating! Is this a significant data point or an outlier?

Let’s query loans with >40 delinquencies and see if this is an isolated or common case?

##        CurrentDelinquencies UdacityRating LoanOriginationDate
## 11902                    41             1          2008-06-11
## 35285                    59             1          2007-01-23
## 40760                    50             1          2007-04-11
## 44477                    57             1          2006-05-18
## 62462                    45             1          2007-07-30
## 83363                    41             1          2006-10-06
## 105566                   40             1          2008-03-10
## 109209                   51             5          2013-12-27
##        LoanOriginalAmount LoanStatus BorrowerAPR Term
## 11902                1000 Chargedoff     0.37453   36
## 35285                1550  Defaulted     0.25723   36
## 40760                1000  Completed     0.27926   36
## 44477                2000 Chargedoff     0.24691   36
## 62462                1000 Chargedoff     0.19986   36
## 83363                4000  Defaulted     0.29776   36
## 105566               1000 Chargedoff     0.37453   36
## 109209              15000    Current     0.17611   36
##        CreditScoreRangeLower ClosedDate StatedMonthlyIncome
## 11902                    520 2009-03-13            1665.667
## 35285                    480 2007-06-24            1969.000
## 40760                    520 2010-04-11            2666.667
## 44477                    480 2008-03-18            6100.000
## 62462                    540 2010-11-29            2500.000
## 83363                    500 2007-05-07            4166.667
## 105566                   520 2010-07-09             600.000
## 109209                   700       <NA>            5000.000

There is one loan in the result set with a Prosper rating of ‘5’. Despite the high (50) delinquencies, he/she has a relatively high credit score (700) compared to other borrowers. In Udacity’s rating algorithm, perhaps credit score is a larger factor than the number of delinquencies. In any case, this data point is an outlier.

Current Credit Lines

Does the number of current credit lines affect credit rating?

At the lower range (0 - 10), # of credit lines appears to affect credit rating. However for borrowers with more than 10 credit lines, there does not seem to be any increase in credit rating.

Employment Duration

How about employment duration in months?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   26.00   67.00   96.07  137.00  755.00    7625

There does not appear to be much of a correlation here. However an interesting thing happens when we switch the X and Y axis. If we plot Prosper Rating (x) against Employment Duration in months, we see a pattern.

On average, for lower rated borrowers there is correlation between # of months employed and Udacity credit rating. However at around credit score 5, the correlation ends and the trend reverses. Borrowers with the highest scores tend to have been employed for fewer months than medium credit borrowers. Perhaps more successful people tend to switch companies to pursue better (higher paying?) opportunities which means less time employed at a single company.

Prosper Rating and Borrower APR

Continuing from where we left off at the histograms in the univariate section, let’s look at the impact of borrower credit rating on Borrower APR.

Generally, borrower credit rating and interest rate are negatively correlated. Higher rated borrowers receive lower interest rates. However credit rating alone does not determine interest rates as can be seen by ranges of interest rates for any particular credit grade / prosper rating.

Loan Defaults

Loan defaults occur when a borrower is unable to pay back their loan. These can be catastrophic to Prosper investors who count on an income stream from interest payments and their original investment to be returned at loan maturity.

Can Prosper predict loan defaults?

Prosper uses borrow credit reports to predict default rates (and assign appropriate interest rates). How well do credit scores predict loan defaults?

Credit Score and Loan Default %

Similar to the above charts, there is some noise for loans with credit scores less than 500. Let’s remove these outliers and redraw the graph:

This looks ‘better’. As the borrower credit scores increase, loan default % decreases. After a steep decrease until ~680, the default percent stays relatively flat.

Looking at ‘default %’ using the entire Prosper loans dataset is problematic in that both matured loans and new/open loans are included in the data set. If we are trying to calculate a “default percent”, it is not quite ‘correct’ to include loans that have only been issued for a few months. A better approach would be to take all loans on or before a ‘cutoff date’ and evaluate default percentages based on this reduced dataset.

To identify a cutoff date, let’s check the most recent loan date and the average # of days that a default occurs based on defaulted loans’ origination dates and close dates.

## Most recent loan was originated in:  16141
## Average time (# days) to default (for defaulted loans) 510.8509
## Using mean days to default, what is a 'safe' day before which to  look at defaulted loans?  2012-10-17

Using 2012-10-17 as a cutoff date, let’s draw the graph again for >500 credit score loans:

This chart shows a smoother decrease in default percentages for low-middle credit scores. The previous chart showed a sharp decrease in defaults which was likely due to newer (still current) loans being included in the dataset. During the term of these loans, many more will default if previous patterns are consistent. We are still underestimating default rates towards the end of our limited data, but hopefully we capture many of the defaults while using as much of the data as possible.

Prosper Rating and Defaults

How well does the 7-point Prosper Rating score (which takes into account other factors like employment status and income) predict defaults?

Higher Prosper Rating also correlated to lower defaults. However the credit scores provide a much more granular predicting factor for default %s. With only 7 possible Prosper Rating scores, many different borrowers will get grouped into the same score which is why the lowest rated borrows still default less than 40% of the time (as opposed to the 60+% default rate for the lowest credit scores).

Does the EstimatedLoss field correlate with more defaults?

The line graph has a lot of noise. Since EstimatedLoss is a very precise field (4-digits precision), the data is quite scattered. Let’s try rounding to 2-digits (creating buckets) and see if the correlation appears stronger.

This second chart shows a stronger correlation between higher EstimatedLoss predictions and number of actual defaults. Since estimated loss is likely at least partially derived from a borrower’s credit rating, this is not a surprise.

Unlike the previous graphs, here we can see how closely defaults are matching Prosper’s predictions. At first glance, it appears that Prosper’s predictions are proving too conservative. At 0.1% estimated losses, roughly 0.2% of the loans are defaulting. However, estimated loss is a prediction of lost PRINCIPAL, not defaults, so let us run a similar chart with estimated loss plotted against actual principal loss.

For loans where Estimated Losses are <0.2%, actual losses exceed Prosper’s estimates. However, above 0.2%, Prosper’s estimates are higher than actual losses. How does Prosper do overall on predicting daily losses versus the actual principal losses?

This plot paints a different picture. Through the end of 2010, Prosper underestimated the actual losses by a small margin. In 2011 and 2012, the gap between estimated and actual losses increased to about $1M/quarter. This coincides with an increase in loan volumes so perhaps the difference is not so meaningful.

In late 2012, the actual losses appear to dive. However this is likely due to new loans that are still current but may default soon enough (see details above).

Staying on topic of defaults, does BorrowerAPR accurately predict risk of default? Let’s plot default rate against 2-decimal rounded APR loans:

Interesting! Although default rates are higher for extremely high (>40%) interest rates, a larger section of intreset rates (15% - 40%) shows a flat default rate where a third of the loans default.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

From my analysis, BorrowerAPR is determined primarily by a borrower’s credit score (and Prosper Rating). Other factors affecting APR possibly include defaults and employment duration. This is difficult to determine by the provided data since credit score itself likely takes into account these factors. The Prosper Rating by itself is enough to estimate a pretty narrow range for APR.

Comparing defaults to paid-off loans, it appears that on average the defaulted loans are paying higher interest rates on their loans.

Default rates correlate negatively with Prosper Rating. Low rated loans default close to 40% of the time whereas top-rated (Prosper Rating = 7) loans default less than 15% of the time.

Credit score on its own also correlates negatively with default rates. Credit scores below 600 default more than 50% of the time. High (>850) credit scores default less than 5% of the time. Since credit score is a lot more granular, this value provides much more detail than the 7-category Prosper Rating score.

Estimated Loss also is also correlated with default rate but not as strongly as Credit Rating or credit scores.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Prosper Rating better predicts default rates as opposed to Prosper Score. Estimated losses overall appear to underestimate the actual principal losses due to defaults. Prosper seems to be getting better at predicting defaults compared to when they first started.

What was the strongest relationship you found?

The strongest relationships I found were twofold:

  1. Credit Score and Prosper Rating
  2. Credit Score and Loan Default %

Multivariate Plots Section

Date vs. Borrower APR (separated by Prosper Rating)

Interest rates change every day. So may the Prosper Rating. When we look at average APR across 2005 - 2014 loans, we miss the daily fluctuations in interest rate. If we track interest rates segregated by Prosper Rating, what can we learn?

This graph is interesting in two ways: 1. Prosper Rating’s influence on interest rates is obvious. Unlike the histograms in the univariate plot, this time we see a narrow band of interest rates for each Prosper Rating. 2. As time goes on, the interest rates become clearly tiered and separated (by Prosper Rating) especially compared to pre-2009 interest rates.

Is interest rate on average higher for loans that default versus those that do not?

On most days, loans that defaulted, on average, were assigned higher interest rates than those that did not. How about on a quarterly basis?

This graph is much easier to read than the daily view.

Except for Q2 2009, on a quarterly basis, interest rates appear around 5% higher for loans that eventually default versus loans that are paid back in full. 2009 was the year of the SEC shutdown and there were very few loans before the second half of the year. How many loans?

## # of Loans in Q2 2009: 13

There were only 13 loans in this quarter. This is an extremely small dataset for a whole quarter therefore we can ignore this datapoint as an outlier.

Of loans issued daily, how many default for different borrowers across different credit (Prosper) ratings?

There are some interesting findings here: 1. Generally higher ratings translate to fewer defaults. 2. In early quarters, low rated loans defaulted at extremely high rates (>60%)! This rate came down to below 40% in 2009 perhaps due to improved loan underwriting. 3. Default rates do overlap between the lowest rated (1-2) and medium rated (4-5) loans. However overall the ratings do tend to correctly predict relative default rates. 4. At the end of the graph, many of the lines do cross over. It will be interesting to see if the default rates change before the loans mature.

Does Prosper Score, which takes into account previous loan performance, provide a better or worse indicator of relative defaults?

This graph is incredibly hard ot look at. The default rates cross over each other frequently and it’s hard to identify the 11 colors on a single chart. Let’s try to break this up into higher scoring (>4) versus lower scoring (<=4) loans. First, let’s look at the lowest scoring loans:

Default rates vary wildly from quarter to quarter. Default rates swing from almost 45% to less than 30% in a single quarter for the lowest (1) scoring loans. In the 1st quarter of 2011, the 2nd lowest scoring tier loans did not default at all. This looks suspicious. How many loans exist in each Prosper Score bucket?

##     1     2     3     4     5     6     7     8     9    10    11  NA's 
##   847  1333  1813  3436  4374  5504  4475  6916  4068  2053     0 29084
##   1   2   3   4   5   6   7   8   9  10  11 
##   0   1   0 113 263 352 255 356 229 175   0
##   1   2   3   4   5   6   7   8   9  10  11 
## 135 113 136 353 352 392 278 386 200 133   0

Interesting… in the first quarter of 2011, there was only 1 loan in the lowest 3 prosper scores. Then in the second quarter, this number increased to 350+ loans. Overall, for the lower scores, it does not look like a strong negative correlation between Prosper Score and default rates. Does this look different for the higher scores?

This chart is more like what I was expecting. Loans which have been assigned the highest Prosper Score have very low (<5%) default rates THere is some overlap across the middle scores, but overall the trend is clear. Higher scores correlate with lower defaults.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Looking at a daily or quarterly snapshot of loans and interest rates (as opposed to all dates at once) reduces noise created by daily changes to interest rates, Prosper loan pricing algorithms, and improved risk detection procedures. There is less overlap of mean interest rates for borrowers of different prosper credit ratings.

Were there any interesting or surprising interactions between features?

Yes. I was surprised that borrower credit rating seems to be ‘better’ at predicting default rates than Prosper’s “Prosper Score” which is supposed to better approximate risk with additional analytics based on Prosper loan performance.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

No I did not create a model for this dataset. I considered looking at a model for Borrower APR or Estimated Loss. However it is very difficult to identify the criteria used to determine either factor beyond the credit scores behind each loan.


Final Plots and Summary

Plot One

Description One

We drew a histogram earlier showing loan counts and aggregate loan amounts on a daily basis for the entire dataset. The boxplot here provides addition detail that adds new information.

  1. The loan limit seems to have been raised from $25,000 to $35,000 in 2013.
  2. Despite the top quartile of loans staying flat at $15,000 from 2013, the mean loan amount continues to rise through 2014.
  3. In the months leading to Prosper’s shutdown in 2009, the mean loan amount dropped (likely due to publicity around SEC investigation). After Prosper restarted their service, it took almost 2-years for the mean loan amount to recover.

Plot Two

Description Two

Looking at loan status across all dates (and across Borrower Rating) reveals some interesting facts:

  1. Loan prepayments are everywhere at all levels. It seems that no prosper borrowers are holding their loans for the full term. As a Prosper investor, I would be very concerned about this. As a lender, I evaluate each loan and choose to invest after becoming comfortable with the risks and returns. Every time a loan is prepaid before the term is complete, I need to find another loan with similar criteria (and risk) to keep receiving my targeted return. Every investment is a risk and this makes investing in Prosper loans more risky and more work.

  2. As witnessed in some of the above charts, lower rated borrowers defaulted at high rates early on. These default rates came down across all rating levels after the 2009 shutdown. Defaults are creeping up again at all borrower credit levels, but the increase is not as high as the total number of loans. Lower rated borrowers are defaulting at much higher rates than high rated borrowers. This is expected. However PRosper should keep an eye on the default rates and make sure they stay manageable.

Plot Three

Description Three

This is a scatterplot version of the Prosper Rate vs. Mean Borrower APR line graph that we plotted in the Multivariate section. The difference is hear we see all data points instead of just the mean interest rates. With the additional data comes some new findings.

  1. Similar to the mean APR graph above, we can see the interest rates become much more aligned with the Prosper Rating, especially after 2009. Towards late 2012, we see a clean band of interest rates marking each Prosper Rating tier.
  2. Early Prosper data shows a wide variety of interest rates especially for poorly rated borrowers. At the lowest tier (Rating = 1), borrowers are charged interest rates ranging from 15% to over 50%.
  3. There is a period from 2011 - mid-2012 where interest rates are not as dispersed as other periods on the chart. There is a lot of white space between data points. This white space almost disappears in mid 2012 when presumably Prosper readjusted how they set the interest rates on loans.

Reflection

I was excited to have the opportunity to look at real loan data from Prosper Market, a FinTech pioneer, who was one of the first US providers of peer-to-peer lending products.

This dataset is particularly interesting because the loan data contains three different types of information in each record:

Borrower details and loan details are setup when the loan is issued. This data is not updated afterwards. The loan status is updated during the life of the loan.

Using this information, I attempted to derive the “Default Ratio” of Prosper Loans and collect visualizations showing whether borrower credit rating or other information could predict higher or lower loan defaults. While focusing on older loans (more than 511 days old, reflecting the average life of a defaulted loan), I noticed a tight correlation between credit rating and loan default. This was noticeable with Prosper’s 7-tier rating system but was even more apparent when looking at borrower credit scores. Borrowers with the highest credit scores rarely defaulted on their loans while low credit score borrowers defaulted most of the time on their loans. Interest rates were also tied closely to credit scores which makes sense considering the relationship between credit score and loan defaults.

Initially I was looking forward to comparing Prosper’s Estimated Return against actual investor returns from loans in this dataset. Unfortunately I quickly ran into problems due to limited knowledge of this dataset. SOme of these problems were:

  1. Limited understanding of the fees that erode an investor’s return as borrowers pay back their loans.
  2. Limited information about recoveries. When does an investor receive their share of the recoveries? How much of the recovery will they receive?
  3. Prepayments end loans early. Again, when does an investor receive their share of the principal / interest payment.

Perhaps the actual return on investment from Prosper loans would be more easily extracted using a different kind of data. Pulling data from an Accounts Payable system that pays out the loan investors would be a more accurate way to identify the investor returns and compare them to the originating loans (and estimated returns). As Prosper improves their algorithms to better identify potential risks due to defaults and prepayments, they can more effectively communicate these risks to investors who will be happier when this product (Prosper Loans) behaves according to expectations with fewer surprises.

Cleanup